Goto

Collaborating Authors

 context list


TurboBias: Universal ASR Context-Biasing powered by GPU-accelerated Phrase-Boosting Tree

Andrusenko, Andrei, Bataev, Vladimir, Grigoryan, Lilit, Lavrukhin, Vitaly, Ginsburg, Boris

arXiv.org Artificial Intelligence

--Recognizing specific key phrases is an essential task for contextualized Automatic Speech Recognition (ASR). However, most existing context-biasing approaches have limitations associated with the necessity of additional model training, significantly slow down the decoding process, or constrain the choice of the ASR system type. This paper proposes a universal ASR context-biasing framework that supports all major types: CTC, Transducers, and Attention Encoder-Decoder models. The framework is based on a GPU-accelerated word boosting tree, which enables it to be used in shallow fusion mode for greedy and beam search decoding without noticeable speed degradation, even with a vast number of key phrases (up to 20K items). The obtained results showed high efficiency of the proposed method, surpassing the considered open-source context-biasing approaches in accuracy and decoding speed. Our context-biasing framework is open-sourced as a part of the NeMo toolkit. Modern end-to-end automatic speech recognition (ASR) systems, such as Connectionist Temporal Classification (CTC) [1], Recurrent Neural Transducer (RNN-T) [2], and Attention Encoder-Decoder (AED) [3], already achieve relatively high speech recognition accuracy in common data domains [4].


An Effective Context-Balanced Adaptation Approach for Long-Tailed Speech Recognition

Wang, Yi-Cheng, Pai, Li-Ting, Yan, Bi-Cheng, Wang, Hsin-Wei, Lin, Chi-Han, Chen, Berlin

arXiv.org Artificial Intelligence

End-to-end (E2E) automatic speech recognition (ASR) models have become standard practice for various commercial applications. However, in real-world scenarios, the long-tailed nature of word distribution often leads E2E ASR models to perform well on common words but fall short in recognizing uncommon ones. Recently, the notion of a contextual adapter (CA) was proposed to infuse external knowledge represented by a context word list into E2E ASR models. Although CA can improve recognition performance on rare words, two crucial data imbalance problems remain. First, when using low-frequency words as context words during training, since these words rarely occur in the utterance, CA becomes prone to overfit on attending to the token due to higher-frequency words not being present in the context list. Second, the long-tailed distribution within the context list itself still causes the model to perform poorly on low-frequency context words. In light of this, we explore in-depth the impact of altering the context list to have words with different frequency distributions on model performance, and meanwhile extend CA with a simple yet effective context-balanced learning objective. A series of experiments conducted on the AISHELL-1 benchmark dataset suggests that using all vocabulary words from the training corpus as the context list and pairing them with our balanced objective yields the best performance, demonstrating a significant reduction in character error rate (CER) by up to 1.21% and a more pronounced 9.44% reduction in the error rate of zero-shot words.


Approximate Nearest Neighbour Phrase Mining for Contextual Speech Recognition

Bleeker, Maurits, Swietojanski, Pawel, Braun, Stefan, Zhuang, Xiaodan

arXiv.org Artificial Intelligence

This paper presents an extension to train end-to-end Context-Aware Transformer Transducer ( CATT ) models by using a simple, yet efficient method of mining hard negative phrases from the latent space of the context encoder. During training, given a reference query, we mine a number of similar phrases using approximate nearest neighbour search. These sampled phrases are then used as negative examples in the context list alongside random and ground truth contextual information. By including approximate nearest neighbour phrases (ANN-P) in the context list, we encourage the learned representation to disambiguate between similar, but not identical, biasing phrases. This improves biasing accuracy when there are several similar phrases in the biasing inventory. We carry out experiments in a large-scale data regime obtaining up to 7% relative word error rate reductions for the contextual portion of test data. We also extend and evaluate CATT approach in streaming applications.


A Light-weight contextual spelling correction model for customizing transducer-based speech recognition systems

Wang, Xiaoqiang, Liu, Yanqing, Zhao, Sheng, Li, Jinyu

arXiv.org Artificial Intelligence

It's challenging to customize transducer-based automatic In this work, we propose a novel contextual biasing method speech recognition (ASR) system with context information which leverages contextual information by adding a contextual which is dynamic and unavailable during model training. In spelling correction (CSC) model on top of the transducer this work, we introduce a light-weight contextual spelling correction model. To consider contextual information during correction, model to correct context-related recognition errors in a context encoder which encodes context phrases into hidden transducer-based ASR systems. We incorporate the context information embeddings is added to the spelling correction model [16, 17], into the spelling correction model with a shared context the decoder of the correction model then attends to the context encoder and use a filtering algorithm to handle large-size encoder and text encoder by attention mechanism [18].